The Bioinformatics Resource Center ATAC-seq pipeline provides QC and processed files for use in secondary analysis systems.

This report was written by Doug Barrows and Thomas Carroll at the Rockefeller University Bioinformatics Resource Center with contributions by Elitsa Stoyanova and Hasnahana Chetia from the Heintz laboratory.

Input Files

Input files required by the pipeline (links contain information about each file format):

Sample data

Reference data

Input_Files Name
Sample Name Sample_2
All_FASTQ 2_S2_R1_001.fastq.gz;2_S2_R2_001.fastq.gz 2_S2_R1_001_copy1.fastq.gz;2_S2_R2_001_copy1.fastq.gz
Input_Files Name
Genome hg38-UCSC
Reference Fasta BSgenome.Hsapiens.UCSC.hg38.fa
GTF TxDb.Hsapiens.UCSC.hg38.knownGene.gtf.gz
ID to Symbol map org.Hs.eg_ID2Symbol.txt

Output files

Quality control files summarized in this report (links navigate to appropriate section of the report):

Processed files generated by the pipeline (links contain information about each file format):

Output_Files File_Name
Fastp report Sample_2_pairs1.fq.gz.json
Picard AlignmentSummaryMetrics Sorted_Sample_2.AlignmentSummaryMetrics.txt
ATACseqQC summary Sorted_Sample_2ATACseqQC.json
ChIPQC summary Sample_2.RData
BAM file Sorted_Sample_2.bam
bigWig file Sorted_Sample_2Normalised.bw
MACs peak calls Sample_2_withInput_Sample_2_peaks.narrowPeak
Reads/Counts in Peaks Sorted_Sample_2_subread_Peak_Antibody_ATAC_Consensus_fromMACSisBlacklisted.Counts
Reads/Counts in TSSs Sorted_Sample_2_subread_Peak_TSS_PositionsisBlacklisted.Counts

FASTQ Read quality assessment with fastp

Return to top

Read quality is assessed using the BRC’s Rfastp R package which provides an R interface for the fastp and genocore C libraries.

In contrast to quality control through FastQC, Rfastp will calculate statistics across the complete FQ file instead of a subsample of reads. Additionally Rfastp includes assessment for read pairs instead of disjointed QC for each file of a pair.

Rfastp captures metrics on:

  • Read quality
  • Duplication rate
  • Insert size
  • Base frequency of cycles
  • Base quality over cycles
  • Kmer frequency

Summary of FASTQ quality control

Fastp assesses each read based on a number of quality metrics, and determines whether that read meets a quality filtering threshold. Reads did not pass the quality filter if at least one of the following are true:

  • The read has at least 40% of bases with a quality phred score of less than 15
  • The read had more than 5 ‘N’ bases (could not be assigned A,G,C, or T)
  • The read was shorter than 15 bases.

The first table shows the quality filtering results while the second table shows specific quality metrics for all of the reads compared to those that passed filtering. The quality scores (i.e. phred score) for each base (e.g. q20 and q30) in the table below reflect the likelihood that the assignment of that base is incorrect.

Fastp quality filtering results (Hover over the names of for a description)
Metric Number_of_reads
passed_filter_reads 52,671,092
failed_filter_reads 1,364,128
low_quality_reads 1,255,894
too_many_N_reads 5,848
too_short_reads 102,386
Metrics for the FASTQ reads before and after quality filtering
Metric Before_Filtering After_Filtering
total_reads 54,035,220 52,671,092
total_bases 4,106,676,720 3,757,697,827
q20_bases 3,958,677,856 3,655,769,311
q20_rate 0.964 0.9729
q30_bases 3,905,535,209 3,614,638,309
q30_rate 0.951 0.9619
read1_mean_length 76 71
read2_mean_length 76 71
gc_content 0.5102 0.5098

Duplication in FASTQ

Metric Value
Duplication Rate 0.0798165


Read Overlap Insert size distribution

The insert size represents the length of the original nucleotide fragment (not including adapters) and is estimated based on the overlap of paired-end reads.

Read base content

For each sequencing cycle (separated by read 1 or read 2 for paired-end), which corresponds to each position in the read, the proportion for each nucleotide (or combination of G and C) across all reads is shown.

Base Quality

For each sequencing cycle (separated by read 1 or read 2 for paired-end), which corresponds to each position in the read, the average phred quality score is shown. The quality score is equal to -10*log10(P), with P being the probability that a base is incorrect. So if the quality score is 30, this means there is a 0.1% chance (P = 0.001) that the base is incorrect.

Kmer content

The most common Kmers (of length 5) are shown below. Hover over each bar to see the exact value for that Kmer.

Alignment - BAM file Metrics

Alignment Quality Control using Picard AlignmentSummaryMetrics

Return to top

Picard and Java are included into the pipeline through installation by the BRC’s CondaSysReqs package. CondaSysReqs installs external software requirements in a dedicated Conda environment versions alongside the pipeline version.

The pipeline generates the AlignmentSummaryMetrics Picard quality control software.

Various Picard metrics related to the alignment are shown. To get a description of each metric, hover over the name or look at the table of descriptions below

Metric FIRST_OF_PAIR SECOND_OF_PAIR
TOTAL_READS 27,017,610 27,017,610
PF_READS 27,017,610 27,017,610
PCT_PF_READS 1 1
PF_NOISE_READS 0 0
PF_READS_ALIGNED 25,587,038 25,345,455
PCT_PF_READS_ALIGNED 0.947 0.9381
PF_HQ_ALIGNED_READS 23,091,780 22,743,989
MEAN_READ_LENGTH 76 76
READS_ALIGNED_IN_PAIRS 24,882,389 24,882,389
PCT_READS_ALIGNED_IN_PAIRS 0.9725 0.9817
PF_READS_IMPROPER_PAIRS 2,670,939 2,429,356
PCT_PF_READS_IMPROPER_PAIRS 0.1044 0.0959
BAD_CYCLES 0 0
STRAND_BALANCE 0.4985 0.5031
PCT_CHIMERAS 0.041 0.0416
PCT_ADAPTER 0 0

ATAC-seq QC Metrics

Return to top

ATAC-seq specific quality metrics are retrieved using the ATACseqQC R/Bioconductor package.

These provide additional insight into the quality of the ATAC-seq data given expected features of ATAC-seq.

For samples with larger number of reads (> 40M reads) we downsample reads while maintaining paired-end read information to 40 million reads.

General Metrics

The first table provides annotation independent assessment of ATAC-seq quality including the rate of reads mapping in proper pairs (paired reads within defined distance of each other), the rate of reads mapping to the mitochondria, the fraction of non-duplicate reads and the two pcr bottle neck coefficients (PBC1/PBC2). For more information on these metrics, hover over the terms or see the Glossary at the bottom of page.

Metric Value
Proper_Pair_Rate 0.8999
Mapped_To_Mitochondria_Rate 0.0661
Non_Redundant_Fraction 0.8605
PBC_1 0.9425
PBC_2 20.1489

Transcription Start Site (TSS) Enrichment Score

The below table contains the annotation specific ATAC-seq quality metric, Transcription Start Site (TSS) Enrichment Score or TSSE. This scores assesses the enrichment of signal within a TSS compared to the surrounding area. For more information on TSSE scores, please see the Glossary at the bottom of page.

Metric TSSE_Score
TSSE 38.6486

Insert Size Lengths

Return to top

The plot and table below show the the insert sizes for all aligned fragments and highlights the nucleosome free, mono-nucleosome and di-nucleosome proportions of the data. The presense of clear bump in the mono-nucleosome fractions indicates a higher quality ATAC-seq dataset.

ATAC_Fraction Proportion_Of_Mapped
Nucleosome Free 0.2492
Mono-Nucleosome 0.1818
Di-Nucleosome 0.1438

Peak quality metrics with ChIPQC

Return to top

ATAC-seq peaks

For ATAC-seq open regions are detected using MACS2.

For paired-end ATAC-seq data, multimapping and non-properly paired reads are removed using Rsamtools libraries and resulting BAM file used to call MACs peaks with default parameters in BAMPE mode.

In the below table we assess the total number of peaks and the fraction of reads mapping to peaks (FRIP).

Metric Value
Total number of peaks 39,304
Fraction of reads in peaks (FRIP) 0.4755

Fraction of reads in genomic annotation

We further assess the distribution of signal across genomic annotation as shown in the below table. As these regions are non-exclusive and are overlapping, the signal in overlapping regions will be counted in both regions and so the sum of table and reads in intergenic regions will not add up to 1.

Annotation Metric
LongPromoter20000to2000 0.1273
Promoters2000to500 0.011
Promoters500 0.0037
All5utrs 0.0034
Alltranscripts 0.4842
Allcds 0.0117
Allintrons 0.4628
All3utrs 0.0141

Glossary of Terms

Key terms with Course Links

Return to top

Description of Picard AlignmentSummaryMetrics

Return to Alignment metric table

Description of ATACseqQC Metrics

Return to ATACseqQC metric table

Software used

Packages outside of R

Package Citation
Picard Tools https://broadinstitute.github.io/picard/

R packages

## R version 3.5.1 (2018-07-02)
## Platform: x86_64-pc-linux-gnu (64-bit)
## Running under: Red Hat Enterprise Linux Server 7.5 (Maipo)
## 
## Matrix products: default
## BLAS: /rugpfs/fs0/ruit/scratch/tcarroll/R-3.5.1/lib/libRblas.so
## LAPACK: /rugpfs/fs0/ruit/scratch/tcarroll/R-3.5.1/lib/libRlapack.so
## 
## locale:
##  [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
##  [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8    
##  [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
##  [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
##  [9] LC_ADDRESS=C               LC_TELEPHONE=C            
## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       
## 
## attached base packages:
## [1] parallel  stats4    stats     graphics  grDevices utils     datasets 
## [8] methods   base     
## 
## other attached packages:
##  [1] bindrcpp_0.2.2              rio_0.5.16                 
##  [3] GlossyBRC_0.1.0             plotly_4.8.0               
##  [5] kableExtra_1.2.1            magrittr_1.5               
##  [7] forcats_0.3.0               stringr_1.3.1              
##  [9] dplyr_0.7.6                 purrr_0.2.5                
## [11] readr_1.3.1                 tidyr_0.8.1                
## [13] tibble_1.4.2                tidyverse_1.2.1            
## [15] cowplot_0.9.3               DT_0.4                     
## [17] ChIPQC_1.16.2               DiffBind_2.8.0             
## [19] SummarizedExperiment_1.10.1 DelayedArray_0.6.6         
## [21] BiocParallel_1.14.2         matrixStats_0.54.0         
## [23] Biobase_2.40.0              GenomicRanges_1.32.7       
## [25] GenomeInfoDb_1.16.0         IRanges_2.14.12            
## [27] S4Vectors_0.18.3            BiocGenerics_0.26.0        
## [29] ggplot2_3.0.0               jsonlite_1.5               
## [31] rjson_0.2.20                rmarkdown_1.10             
## 
## loaded via a namespace (and not attached):
##   [1] sendmailR_1.2-1                          
##   [2] SparseM_1.77                             
##   [3] rtracklayer_1.40.6                       
##   [4] AnnotationForge_1.22.2                   
##   [5] prabclus_2.2-6                           
##   [6] acepack_1.4.1                            
##   [7] bit64_0.9-7                              
##   [8] knitr_1.20                               
##   [9] data.table_1.11.4                        
##  [10] rpart_4.1-13                             
##  [11] hwriter_1.3.2                            
##  [12] RCurl_1.95-4.11                          
##  [13] GenomicFeatures_1.34.4                   
##  [14] TxDb.Rnorvegicus.UCSC.rn4.ensGene_3.2.2  
##  [15] RSQLite_2.1.1                            
##  [16] BSgenome.Hsapiens.UCSC.hg19_1.4.0        
##  [17] bit_1.1-14                               
##  [18] webshot_0.5.1                            
##  [19] xml2_1.2.0                               
##  [20] lubridate_1.7.4                          
##  [21] httpuv_1.4.5                             
##  [22] assertthat_0.2.0                         
##  [23] viridis_0.5.1                            
##  [24] amap_0.8-16                              
##  [25] tximport_1.8.0                           
##  [26] hms_0.4.2                                
##  [27] promises_1.0.1                           
##  [28] evaluate_0.11                            
##  [29] TSP_1.1-6                                
##  [30] DEoptimR_1.0-8                           
##  [31] progress_1.2.0                           
##  [32] caTools_1.17.1.1                         
##  [33] dendextend_1.8.0                         
##  [34] readxl_1.1.0                             
##  [35] Rgraphviz_2.24.0                         
##  [36] DBI_1.0.0                                
##  [37] geneplotter_1.58.0                       
##  [38] htmlwidgets_1.2                          
##  [39] ngsPipeR_0.9.431                         
##  [40] crosstalk_1.0.0                          
##  [41] backports_1.1.2                          
##  [42] BatchJobs_1.7                            
##  [43] trimcluster_0.1-2.1                      
##  [44] annotate_1.58.0                          
##  [45] biomaRt_2.36.1                           
##  [46] prettydoc_0.2.1                          
##  [47] Rfastp_0.1.2                             
##  [48] withr_2.1.2                              
##  [49] BSgenome_1.48.0                          
##  [50] robustbase_0.93-2                        
##  [51] checkmate_1.8.5                          
##  [52] GenomicAlignments_1.16.0                 
##  [53] gclus_1.3.1                              
##  [54] prettyunits_1.0.2                        
##  [55] mclust_5.4.1                             
##  [56] TxDb.Mmusculus.UCSC.mm10.knownGene_3.4.0 
##  [57] cluster_2.0.7-1                          
##  [58] lazyeval_0.2.1                           
##  [59] crayon_1.3.4                             
##  [60] genefilter_1.62.0                        
##  [61] edgeR_3.22.3                             
##  [62] pkgconfig_2.0.2                          
##  [63] labeling_0.3                             
##  [64] nlme_3.1-137                             
##  [65] seriation_1.2-3                          
##  [66] nnet_7.3-12                              
##  [67] bindr_0.1.1                              
##  [68] rlang_0.2.2                              
##  [69] diptest_0.75-7                           
##  [70] registry_0.5                             
##  [71] GOstats_2.46.0                           
##  [72] modelr_0.1.2                             
##  [73] cellranger_1.1.0                         
##  [74] rprojroot_1.3-2                          
##  [75] graph_1.58.0                             
##  [76] Matrix_1.2-14                            
##  [77] chipseq_1.30.0                           
##  [78] Rsubread_1.30.6                          
##  [79] base64enc_0.1-3                          
##  [80] whisker_0.3-2                            
##  [81] pheatmap_1.0.10                          
##  [82] viridisLite_0.3.0                        
##  [83] BSgenome.Mmusculus.UCSC.mm10_1.4.0       
##  [84] bitops_1.0-6                             
##  [85] KernSmooth_2.23-15                       
##  [86] Biostrings_2.48.0                        
##  [87] blob_1.1.1                               
##  [88] ShortRead_1.38.0                         
##  [89] brew_1.0-6                               
##  [90] TxDb.Mmusculus.UCSC.mm9.knownGene_3.2.2  
##  [91] scales_1.0.0                             
##  [92] memoise_1.1.0                            
##  [93] GSEABase_1.42.0                          
##  [94] plyr_1.8.4                               
##  [95] gplots_3.0.1                             
##  [96] gdata_2.18.0                             
##  [97] zlibbioc_1.26.0                          
##  [98] compiler_3.5.1                           
##  [99] RColorBrewer_1.1-2                       
## [100] DESeq2_1.20.0                            
## [101] Rsamtools_1.32.3                         
## [102] cli_1.0.0                                
## [103] systemPipeR_1.14.0                       
## [104] XVector_0.20.0                           
## [105] Category_2.46.0                          
## [106] htmlTable_1.12                           
## [107] Formula_1.2-3                            
## [108] MASS_7.3-50                              
## [109] tidyselect_0.2.4                         
## [110] widgetframe_0.3.1                        
## [111] stringi_1.2.4                            
## [112] yaml_2.2.0                               
## [113] locfit_1.5-9.1                           
## [114] latticeExtra_0.6-28                      
## [115] ggrepel_0.8.0                            
## [116] grid_3.5.1                               
## [117] tools_3.5.1                              
## [118] rstudioapi_0.7                           
## [119] foreach_1.4.4                            
## [120] foreign_0.8-71                           
## [121] TxDb.Hsapiens.UCSC.hg19.knownGene_3.2.2  
## [122] DEXSeq_1.26.0                            
## [123] gridExtra_2.3                            
## [124] digest_0.6.17                            
## [125] shiny_1.1.0                              
## [126] fpc_2.1-11.1                             
## [127] Rcpp_1.0.4.6                             
## [128] broom_0.5.0                              
## [129] later_0.7.4                              
## [130] org.Hs.eg.db_3.6.0                       
## [131] httr_1.4.1                               
## [132] Nozzle.R1_1.1-1                          
## [133] AnnotationDbi_1.44.0                     
## [134] TxDb.Celegans.UCSC.ce6.ensGene_3.2.2     
## [135] kernlab_0.9-27                           
## [136] colorspace_1.3-2                         
## [137] rvest_0.3.2                              
## [138] XML_3.98-1.16                            
## [139] topGO_2.32.0                             
## [140] splines_3.5.1                            
## [141] RBGL_1.56.0                              
## [142] TxDb.Dmelanogaster.UCSC.dm3.ensGene_3.2.2
## [143] statmod_1.4.30                           
## [144] flexmix_2.3-14                           
## [145] xtable_1.8-3                             
## [146] BBmisc_1.11                              
## [147] heatmaply_0.15.2                         
## [148] modeltools_0.2-22                        
## [149] R6_2.2.2                                 
## [150] Hmisc_4.1-1                              
## [151] mime_0.5                                 
## [152] pillar_1.3.0                             
## [153] htmltools_0.3.6                          
## [154] glue_1.3.0                               
## [155] class_7.3-14                             
## [156] codetools_0.2-15                         
## [157] mvtnorm_1.1-0                            
## [158] lattice_0.20-35                          
## [159] curl_3.2                                 
## [160] gtools_3.8.1                             
## [161] zip_2.0.4                                
## [162] GO.db_3.6.0                              
## [163] openxlsx_4.1.0.1                         
## [164] survival_2.42-6                          
## [165] limma_3.36.3                             
## [166] munsell_0.5.0                            
## [167] GenomeInfoDbData_1.1.0                   
## [168] iterators_1.0.10                         
## [169] haven_2.0.0                              
## [170] reshape2_1.4.3                           
## [171] gtable_0.2.0                             
## [172] TxDb.Hsapiens.UCSC.hg18.knownGene_3.2.2